MATHEMATICS AND IMPLEMENTATION DETAILS 
OF A BLOCK MINRES ALGORITHM* 



KIRK M. SOODHALTERt 

Abstract. We present a block minimum residual (MINRES) algorithm for symmetric indefinite 
matrices based on an alternate method for constructing the block Lanczos vectors. This method 
allows us to compare performance with the standard MINRES algorithm iteration for iteration, and 
it handles removal of dependent Lanczos vectors more gracefully. We describe both a theoretical 
derivation of the algorithm as well as practical implementation details. Some numerical results are 
shown to illustrate performance on some sample problems. We also present some experiments to 
show how the relationship between right-hand sides affects the performance of this method. 

1. Introduction. We wish to efficiently solve 

AX = B (1.1) 

were A G R"^" is a symmetric, indefinite matrix, and B £ R"^^'. li p — 1, precon- 
ditioned Krylov subspace iterative methods for symmetric systems |11[ I14| have been 
shown to be effective for solving such problem. For the case p > 1, there are several 
options. We can solve for each right-hand side in sequence, ignoring the relationship 
between the systems. In solving this sequence of systems, we can also take advantage 
of the fact that the matrix does not change and employ deflation techniques e.g., 
[TJ [ini Uni HH, or other projection techniques, e.g., [T51 [52], to accelerate conver- 
gence of an iteration on one system by deflating with a subspace generated during 
the previous system's iteration. Moreover, we can solve these systems in parallel, by 
solving each system on a different processor. 

We can also employ block Krylov subspace methods. For solving a linear system 
with multiple right-hands, it is quite natural to describe iterative methods which ex- 
tend the idea of the Krylov subspace to the block setting, in which we have more than 
one starting vector. In the case of symmetric systems, so-called block extensions of 
methods such as conjugate gradients [TT] and the minimum residual method (MIN- 
RES) [13], have been previously described [13]. For the case p = I, block methods 
can also be used as a way to accelerate the convergence of the iteration, where we 
generate random right-hand sides with which to generate a basis for the block Krylov 
subspace over which we minimize. Thus, block methods have utility for all p G N. 

In this paper, we present an implementation of the block MINRES algorithm, 
different than the one presented by O'Leary in, [T^. This implementation is based on 
an alternative method of generating a block Krylov subspace introduced by Ruhe [17] 
and further described (for the non symmetric case) in [THJ Section 6.12]. This algo- 
rithm can be considered a simplification of the algorithm presented in [2] in the case 
that A is symmetric, which extends Ruhe's method to generalize the nonsymmetric 
Lanczos process to the block setting. We present both the theoretical details and the 
practical implementation issues for this algorithm. To our knowledge, this is the first 
paper to provide the inTplementation details of a block minimum residual algorithm 
for symmetric matrices Q 
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In the next section, wc introduce notation and review relevant theory about block 
Krylov subspace methods. In Section |31 we derive a version of the block minimum 
residual method built upon Ruhe's block Lanczos method. In Section 31 wc derive our 
new version of the block MINRES algorithm in detail and describe our implementa- 
tion, built to take advantage of the potential memory savings afforded by the method. 
Special attention is given to how data is stored to keep the scheme as simple as pos- 
sible. Wc also present modifications to our implementation which accommodate the 
occurrence of dependence of the block Krylov subspace basis. In Section [SJ we briefly 
discuss convergence properties of block methods. In Section [SI we present numerical 
results. 

2. Preliminaries. We begin by describing some nomenclature and notation. 
We call a vector with multiple columns, such as B when p > 1, a block vector. If we 
want to indicate the number p of columns, we will identify the vector as being block-p. 
Boldface, upper-case letters will be used to denote matrices, including block vectors. 
Boldface lower-case letters will denote column vectors. We denote the Euclidean norm 
by ||-|| and the Frobenius norm by ||-||^. For a square, nonsingular matrix A, we will 
denote the condition number associated to the 2-norm, k(A) = ||A|| ||A~^||. When 
identifying an equation as a QR-factorization, we will use the convention that the 
right-hand side of the equation is the QR-factorization of the left-hand side of the 
equation. We denote the k x k identity matrix 1^. We also use the Matlab-inspired 
notation to indicate rows i to j of the argument. For a matrix M, we denote its 
range by D?(M). 

We also take a moment to clarify the ambiguity surrounding the word deflation. 
This terms refers both to a class of techniques in which we project a Krylov subspace 
orthogonal to a specially selected subspace and construct approximations over an 
augmented Krylov subspace. This term also refers to the process of eliminating a 
dependent basis vector from a block Krylov subspace. In this work, when we use the 
term deflation^ we mean the process used in augmented Krylov subspace methods. 
We will refer to the latter process simply as removal of dependent basis vectors. 

Much has been written about the solution of linear systems with multiple right- 
hand sides. Various strategies have been presented. We could simply solve the systems 
in sequence, ignoring their relationship. Building upon this strategy, there has been 
extensive discussion about solving in sequence, but using the Krylov subspace infor- 
mation from one system to accelerate the solution of the next, either by deflated or 
recycled Krylov subspace methods [iSl [HI [20] . Block Krylov subspace methods have 
been proposed in the symmetric case |12| 1131 "^l ^-nd the nonsymmetric case, e.g., 
[HI [IB]- In many cases, extending the framework of a Krylov method to the block 
right-hand side setting involves generalizing the machinery of, e.g., the Arnoldi pro- 
cess to deal with block vectors, see, for example, [191 Page 208]. We reproduce the 
algorithm here for convenience as Algorithm 12.11 which, at step j, generates a block 
Krylov subspace 



Kj(A,Vi) =aCj(A,v(')) 



+ 



aCj(A,vf^) + --- + D^,(A,vJ^^) 



spanned by the columns of 



W, = [Vi, V2, 




where 



v^P^l for£=l,...,j. 
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Algorithm 2.1: Block Arnoldi Method 



Input : A e C"""", Vi e C"'''', VlVi = Ip _ _ 

Output: An orthonormal basis for 3<;,„(A, Vi) and G (D'^+i^p^^p, lias 
p lower subdiagonal entries 

1 for j = 1, 2, . . . m do 

2 Compute W = AVj 

3 for i = 1,2, . . . , j do 

4 H,.j ^ Vf W 

5 W ^ W V, H, 



Compute the QR-factorization W = Vj+iHj+i j 



Normalization in the case of a single vector is generalized to the computation 
of an orthonormal basis for the column space of a block vector. For clarity, we will 
refer to methods that perform operations at the level of a block of vectors as block- 
level Krylov methods. Let Hj = {ii-i^t) be generated by the block Arnoldi process 
and Hi_£ £ (D^^p. As a consequence of the block normalization, G (^nxp j^g^g 
orthonormal columns and the columns in each block are orthonormal to the columns 
of the other blocks; in other words, WjWj = Ij. Let Ej be the matrix that contains 
the last p columns of the Ijp. As in the single vector case, we have a block Arnoldi 
relation 



AW, =W,H,+V,+iH,+i,,E*. 

To derive a different MINRES algorithm, we will use the block Arnoldi algorithm 
proposed by Ruhe [T7] and described in [T21 Page 209]. For this method, we adopt 
the notation that Up = [ui, . . . , Up] denotes the normalized starting block of vectors. 
Note that Up was called Vi when describing the block-level method. Essentially, 
Ruhe's method performs the same orthogonalization as the true block method, only 
one vector at a time. Therefore, at each iteration of Ruhe's block Arnoldi method, 
one matrix-single-vcctor product is performed as opposed to a matrix-block-vector 
product in a block-level method. For review, wc present Ruhe's block Arnoldi process 
as Algorithm 12.21 as described in [191 Page 209]. 

Algorithm 2.2: Ruhe's Block Arnoldi Method 
Input : A G C"^", Up G C"^p, U;Up = Ip 

Output: Up+j- G (D"x(™+P), U;+^-Up+j = Ip+j- and H^- G C^^+p^^^ has p 
lower subdiagonal entries 

1 for j = p,p + 1, . . . ,j + p — 1 do 

2 Set fc := j — p + 1 

3 Compute w := Aufe for i = 1 : j do 

4 hi,k := U*w 

5 _ W W - hi^kUi 

6 Compute /ij+i.fe := ||w||2 and u^+i := w//ij+i.fc 



We can observe that at iteration j, where j = £p is a multiple of the block size. 
Algorithm 12.21 has produced the same orthonormal basis as Algorithm 12.11 at step i. 
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We also have Ruhe's block Arnoldi relation 



AU, =U,+pH,. (2.1) 
The block Hessenberg matrix Hj has p lower subdiagonal bands and has the structure 

H, 



■J 

H 



pxj. 



where Hj is a square j x j matrix which is banded lower subdiagonal with p bands 
satisfying the identity 

H,=U*AU,. (2.2) 
Observe that Hpx j only has nonzero entries in the last p columns with structure 

Hpxj = [Opx(rn— p) Cljj 

where Cj G (DJ'^p is upper triangular. At step j, we will use the following notation 
to describe the space we have constructed. We can always write j = kp + m where 
< j < p. For the triple {j,k^m), j indicates the current iteration, k indicates 
the beginning dimension of the constituent Krylov subspaces, and m indicates which 
subspace was increased from dimension k to dimension fc + 1 at iteration j. To 
illustrate, the subspace that has been generated at step j is 

Kj,fe,„(A,Up) = 

JCfc+i(A,Ui) + ---+ J^fc+i(A,u,„) + JCfc(A,u,„+i) +--- + 3^fc(A,Up) (2.3) 

+ 1 dim. at iter, j +1 dim. at next iter. 

Similar to the Hermitian Lanczos relation in the case of a single- vector Krylov method, 
observe that if A is Hermitian, the relation (|2.2p implies that Hj is also Hermitian. 
Due to the banded lower subdiagonal structure of Hj, we see that Hj is 2p+ 1 banded 
matrix with p superdiagonal entries and p subdiagonal entries. This structure implies 
that we only need the previous 2p basis vectors of Kj,fc,m(A, Up) in order to compute 
Uj+i . We have the 2p + 1 term recurrence relation 

hj+pjUj+i = Auj - hj^iUi (2.4) 

£— min{l,m— p) 

Due to symmetry, we do not need to compute hi_j where i < j since it was 
computed previously as hjj. This will require the storage of the lower subdiagonal 
entries of the last p columns of Hj. This yields Ruhe's block Lanczos method, which 
we present as Algorithm 12.21 

3. A Block Minimum Residual Method. We derive a minimal residual al- 
gorithm based on this version of the Lanczos algorithm. We take a few moments to 
describe what is meant by a block minimum residual algorithm. If we begin with 
an initial guess Xq = 0, at the jth step the following method will produce an ap 
proximation Xj G ([;!"xp such that for each < i < p, the residual 



minimized over the space Kj^/j^„i(A, B). where x^''' is the ith column of X 



b(») - Axf 



IS 



At step j, we want to minimize each colmnn of the block residual 
Fj = B — AXj over Kj_fe_„(A, B). Following the explanation of MINRES presented 
in [5], we can derive a block MINRES algorithm based on Ruhe's block Lanczos pro- 
cess. Let E^"''' g ]R(^+p)^p be the matrix containing the first p columns of 
Observe that 



E 



'Ixp 



(3.1) 



Given an initial residual Fq we can normalize it using the QR factorization Fq 
At step j of Ruhe's block Lanczos process, we have the QR factorization Hj = QjRj 
is such that e (Ci^i+Pl^U+p) is unitary, and e (C^^+P^^i is upper triangular. 
The matrix Rj has a simple block structure, 



R7 



R7 



where Rj is a square, upper triangular, j x j matrix. Let f -^^ be the ith column of 



Fj, the jth block residual. The minimization of 



can be rewritten as 



xex^ 
min 

min 

min 

min 

min 

yeRJ 



mm 

+K,,fc.„(A,B) 

f«-AU,y 



b^*) - Ax 



(3.2) 



f(0 

^0 

(0 



(i) 



AU,y 



(i) 



The solution of this least squares problem can be computed by solving the normal 
equations. 



F(i) 



Ai) 



(i) 



(3.3) 



(i) 
3+P 



(i) 



Ai) 



(i) 



We can solve the normal equations individually for each right-hand side, or we can 
solve for all right-hand sides simultaneously, i.e., = Rj^(Q*Ej"''s)i:j, where S 



replaces the individual initial residual norm 
minimum residual approximation at step j is 



Xo 



in the block setting. The block 

(3.4) 



Xo + u,R7^(q*e1^')s)i:, 
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Let Tj = Q*E*^^S. Wc define our search directions as the columns of ~ U^R^- 
and observe that the columns of Mj successively span the same subspaces as the 
columns of XJj due to the upper triangular structure of ■ We denote block vector 
of search direction coordinates Tj = 



4. Block MINRES for Symmetric Linear Systems. To obtain a storage- 
efficient block MINRES algorithm based on Ruhe's block Lanczos method, we must 
discuss the structure of Rj . This matrix is the upper jxj block of Rj which is obtained 
from the QR- factorization of . We can obtain this factorization column- by-column 
using Givens rotations to annihilate the subdiagonal entries of each column. Recall 
that for the column pair 



to annihilate /i^+ij-, we can construct the Givens sine/cosine pair 



and Ci+ij 



Annihilating /li+ij can now be accomplished by premultiplication with a unitary 
matrix, 



In column j, the Givens rotations annihilating the lower subdiagonal entries will affect 
rows j to J +p. Since Hj has p superdiagonal entries, the Givens rotations associated 
to column j will add no more than p additional subdiagonal entries to any other 
column. Thus Rj is upper triangular with {2p-\- l)-bands. The jth column of Rj has 
the following structure, denoted r^"*-* , 



So) 
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Wc have that M, 



VjRj^ so that MjRj 



Vj, and this gives us the relationship 



between the block Lanczos vectors and the search directions, 



ri^iHii = Vl 
ri,2mi + r2,2m2 = V2 



(4.1) 



ri,2p+imi + r2,2p+im2 
r2,2p+2m2 + r3,2p+2m3 



7'2p+l,2p+ini2p+l — V2p+1 
7'2p+2,2p+2ni2p+2 = V2p+2 



?-j-2pjmj_2p + r'j_2p+i,jmj_2p+i + • • • + rj^jinj = Vj. 

Thus, to compute nij, we need Vj and the 2p previous search directions. 

Since Rj results from the QR- factorization of Hj, we can use Givens rotations 
to annihilate the lower subdiagonal entries of Hj . This procedure is a generalization 
of the one for MINRES; however, since there are p lower subdiagonal entries, each 
columns requires p Givens rotations. At iteration j, let sj"'^ denote the the unitary 
matrix used to annihilate the lower subdiagonal entries of column j of Hj . We write 



G.'^i ^G)_lo o • • • G,"j:„ o where G^^ is the matrix applying the Givens rotation 



3+P-J 



to annihilate entry hi_j from Hj. The application of the Givens rotations in will 
affect columns j to j + 2p\ so, at step j, the rotations in 9j"'^2p '^''j-i ^^^^ affect the 
jth column of Hj after it is constructed, prior to computing the Givens rotations for 
step j. 

We can also use Householder reflections which require about half the work of 
using Givens rotations. It has been shown that we can effectively store products of a 
series of Householder reflections as a single matrix, applying them at once, without 
sacrificing accuracy [TO]. Based on the implementation details discussed in Section HTT] 
below, using Householder reflections here would be straightforward. 

Observe that at every step, we construct a new set of Givens rotations. At step j, 



g^^' is immediately applied to ^^-^ ■ ■ ■ SY'^E^'S. Initially E^^-'S is upper triangular. 
Multiplying by the first set of Givens rotations adds a subdiagonal band. From (|3.ip . 



(i)-rp(i)< 



we see that E^-* ^■'S is a submatrix of E^-'''S. Furthermore, Sj"L/'' 



■si- 



O-i)j^(i-i) 



>S is 



contained as the upper block in S 



gO)gO)g^ Therefore, the first subdiagonal in 



9i^''e^^-'S exists in g^^''E^^''S. Each further application of Givens rotations adds a new 



subdiagonal so that S 



. g(i)jiO)g Yig,s j lower subdiagonal bands. The rotations 



contained in S, only effect rows j to j 



pof 



■ Si'^e[^'^S. Thus we have the 



relation Tj = 
update of Xj_ 



where tj g C, and we can update Xj progressively as an 



^3 — ^J- 



m,tj. 



(4.2) 



In [19], Saad observes that, as in the single right-hand-side case, a computed 
residual (also sometimes called the recursive residual) is available. 



(4.3) 
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4.1. Implementation Details. Wc conclude our description witli some notes 
about practical implementation details. To summarize our storage needs, we must 
store 2p block Lanczos vectors, 2p search directions, 2p sets of Givens rotations (or 
Householder reflections), the lower subdiagonal entries of p previous columns, and 
the jth column of Hj which will be transformed into the jth column of Rj. The 
fact that A is symmetric allows for large savings in the storage requirements of the 
block MINRES algorithm, but this also destroys much of the natural indexing that 
would be available if complete storage were necessary. We want the algorithm to deal 
with as few special cases as possible. With this in mind, we present data structures 
which allow for simple, efficient local indexing from which some simple patterns arise, 
allowing for a clean, simple algorithm with only one special case for the first iteration. 
We use block size p = 5 when presenting examples demonstrating the utility of the 
strategy. Also, we denote the first-in-first-out (FIFO) queue data structure simply as 
a queue, for convenience. 

Storing basis vectors and search direction in queues offers many advantages. In 
Matlab, this structure does not exist. Thus, we will store information in an array, 
column-wise and treat it like a queue. A column in the array will represent data 
generated at a particular iteration. Any initial data is stored in the rightmost columns, 
and when a new column of information is created, it is inserted into the last column 
with all other data being shifted to the left by one column. The first column of data 
is discarded as new data is added to the last column. This means that in any data 
structure, the last column is associated to the current iteration (the most current 
information) allowing for local indexing, i.e., indexing relative to the most recently 
created columnQ 

We must store 2p block Lanczos vectors, and the most straightforward structure 
to store them in is a n x 2p array called V. At the beginning, V is initialized with 
all zeros, and the columns of Up are stored in the last p columns of V. Aside from 
providing a natural way to create and discard Lanczos vectors without explicitly 
keeping track of indices; with this setup, we always know that the vector in the 
{p+ l)st column is the one on which A acts to generate the next Lanczos vector. The 
2p search directions are stored in a similar structure called M, and M is initialized 
to all zeros. 

For the matrices Hj and Rj, we only need to store the complete set of nonzero 
entries of the jth column. The jth column of Hj has at most 2p + 1 nonzero entries. 
We store them in an array called r which is large enough to accommodate padding at 
the top by zeros which will be filled when the Givens rotations from previous steps are 
applied. While we do not need to store all the entries of the previous columns of Hj or 
Rj , we do need to store the lower subdiagonal entries of the p most recent columns of 
Hj , since the block Lanczos method uses the symmetry of Hj to fill the superdiagonal 
entries of the newest column. We store these entries in a p x p array called H where 
each column of H contains the lower subdiagonal entries of a particular column of 
Hj. In the beginning, H is initialized as all zeros, and the columns of H are filled 
starting from the right, functioning as a queue, as we have already described. Storing 
the entries in this manner results in the nonzero superdiagonal entries of the current 



tit should be noted that to achieve this queue-like behavior without actually moving columns 
of data (an expensive memory movement procedure), we store a set of integer indices which act as 
pointers to the columns of the matrix. These indices are permuted rather than the actual columns. 
For example, if the ith entry in the index list is k, then column i of the array is actually stored in 
the fcth column of the data structure. 
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column of Hj being available as the nonzero antidiagonal entries of H. This allows 
us to obtain the super diagonal entries of the current column without computing the 
associated inner products. For example, suppose j — 7, so we are constructing the 
7th column of H,-. Then we would have 



and H = 



^3,2 


^4,3 


h5,4 


he,5 


h7,6 


h4,2 


^5,3 


^6,4 


h7,5 


^8,6 


^5,2 


^6,3 


h7,4 


^8,5 


^9,6 


h6,2 


h7,3 


^8,4 


^9,5 


^10,6 


h7,2 


^8,3 


hg,4: 


^10,5 


^11,6 



(4.4) 





h2,7 
h3,7 
h4,7 
h5,7 
h6,7 
h7,7 
^8,7 
^9,7 
'^10,7 
/ill, 7 
^112, 7 
^113, 7 
'll4,7 

Observe in (|4.4p that the bold entries in r can be obtained from the antidiagonal entries 
of H, also bolded. Though these entries are computed implicitly due to symmetry, 
one may want to explicitly compute them for increased numerical stability. This is of 
particular concern in the computation of eigenvalues, but it is still worth mentioning 
in this context. 

Lastly, the Givens sine and cosines are stored in px 2p arrays s and c. Each column 
of s and c represents the sines and cosines of Givens rotations used to annihilate entries 
of a particular column Hj . We present Algorithm 14. 1[ which describes the Ruhe 
version of block MINRES, but without the implementation tricks just described. 

4.1.1. Removal of Dependent Basis Vectors. In the single-vector GMRES 
minimum residual method (and thus the MINRES method), it may happen that 
at step j, we have that Avj G Kj(A,ro). This situation is referred to as happy 
breakdown since it means that the true solution is contained in the existing Krylov 
subspace, see, e.g., [I9l Section 6.5.4]. Unfortunately, this is not the case in a block 
Krylov subspace method. We may have dependent block Krylov subspace basis vectors 
without convergence any of the systems. In the case of algorithms which are built 
upon the symmetric or nonsymmetric block Lanczos methods, this dependence of the 
basis can lead to unstable algorithms if not properly handled; see, e.g., [TlfTS]. 

Various strategies have been proposed to mitigate the basis dependence problem. 
For block- level algorithms, one must first compute or estimate the range of the block 
Krylov subspace basis to detect rank deficiency. For symmetric Lanczos-based meth- 
ods, O'Leary [13] advocates removal of the dependent vector, reducing the block size. 
The update procedures for the systems not associated to the removed vector do not 
change, and a progressive update formula can be derived for the systems associated to 
removed right-hand sides. Baglama suggests that instead of simply removing the 
dependent vector and reducing block size, one can instead replace the dependent vec- 
tor with a random vector which has been orthgonalized against all previous Lanczos 
vectors and continue unabated. For nonsymmetric Lanczos-based block QMR, Aliaga 
et al. [5] propose to remove basis vectors before dependence is detected. Due to is- 
sues of stability in block nonsymmetric Lanczos based methods, the authors advocate 



defining a tolerance dtd > 0. After a vector v has been biorthogonafized we have 



Algorithm 4.1: Block MINRES (Rulic Implementation) 



, e > 0, M G N 

\\<eVj<p 



Input : A e C"^" Symmetric, B e C"''p, Xq = 
Output: X e such that 

||B(:,j)-AX(:,j-)||/||B(:,jVAXo(:,j; 

1 Compute the QR- Factorization BVpS 

2 s ^ se[^^ 

3 X^Xo 

4 R ^ B AX 

5 while maxo<i<p | (Tj*'')j+i:j+j } ^ ll'^'"'''!! ^^"^ — 



Av, 
if j > 1 then 

ior i = j -p: j 



1 do 



for 2 = j : j + p — 1 do 

w ^ hijVi 
^j+Pj = l|w|| 

if j > 1 then 



L 



nU) 



for 2 = j + p — 1 : j do 



Generate Givens rotation for GY_^i j to annihilate /ij+i j 



i+i.j 
if m = i then 
\_ mi =vi/Ri(l,l) 

else 

w ^ Vj- 

for j = j — 2p : j — 1 do 
|_ w ^ vi^ - Rj (i, j)mi 

mj = w/RjO-,j) 

t^^S(j,:) 
X 

s 



X 

s 

j • 



m,t^ 



Olxp 

J + 1 
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||v|| < dtoi, then we consider v as almost being dependent, and it is removed from 
the basis. In [5], a bookkeeping scheme is presented to keep track of such removals 
so that the block QMR algorithm can be adjusted accordingly. Recently, this tech- 
nique was extended to a block conjugate gradient method for shifted linear systems 
[4]. The book keeping scheme allows for the dependent basis vectors to be removed 
from the block Lanczos process but temporarily retained in memory for the purposes 
of orthogonalization. 

Dubrulle [6] proposes an alternative to the removal of dependent or near-dependent 
basis vectors, for use in a block conjugate gradient algorithm. He proposes to use a 
change-of-basis strategy for the block descent directions and other algorithmic changes 
to avoid the problem long before near-rank deficiency of the block basis vectors occurs. 
This additionally avoids the need for basis rank estimation. 

For our version of the block MINRES algorithm, we act only when an actual 
dependency occurs (where dependency has the numerical definition that /ij+pj < 7 
for a pre chosen < 7 ^ 1). One of the strengths of a block MINRES algorithm built 
from Ruhe's block Lanczos method is that there is no need for any basis rank estima- 
tion. Since we construct only one block Lanzcos vector at a time, we simply need to 
compute hj^pj, i.e., compute the norm of the newest basis vector after orthogonal- 
ization via Ruhe's block Lanzos process. This observation has been previously made 
(in the context of eigenvalue computations) [3]. Baglama presents two options for 
dealing with basis dependence. One option is to reduce block size by one and adjust 
short-term recurrences accordingly. The other is to generate a random vector and 
orthoganalize it with respect to all previous Lanczos vectors. This normalized vector 
is then put in the place of the dependent basis vector. We show that either option 
results in minimal changes to either Algorithm 14 . 1 1 or the implementation details pre- 
sented in the previous section. We discuss the algorithmic modifications required to 
incorporate both options into our block MINRES algorithm, but we choose the latter 
due the simplicity of incorporating it into the algorithm. For illustration, we present 
examples with smaller block size than previously, for ease of discussion; however, it is 
clear that the simplifications presented do not change if the block size increases. 

We discuss, first, the effects of choosing the first option, block size reduction. 
Suppose we have block size p = 2, and we run ten iterations of the Ruhe block 
Lanczos algorithm, such that no dependent basis vectors arise. We have generated 
block Krylov subspace Ki2,6,o(A, U2). Implicitly, we have also generated 
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(4.5) 



where Hio is symmetric, and we have AUiq = U12H10. Now, suppose that instead 
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Au5 e Kg 3 o(A,U2), numerically. This means that 



AU5 = ft.7,5U7 + /l6,5U6 + /l5,5U5 + /l4,5U4 + /l3,5U3 



(4.6) 



with /i7 5 < 7. Then we take Uy = 0, do not add it to our basis, and assign 5 = 0. 
Furthermore, since this implies that AU7 = 0, we have that Ug = 0, Un = 0, etc. 
Thus, certain entries of Hio will be annihilated. Specifically, columns seven and 
nine are now zero. By symmetry, nonzero entries in rows seven and nine are also 
annihilated, and the same is true for row eleven. With this in mind, we can actually 
construct a smaller, banded matrix and write a more compact block Lanczos relation 
which ignores the eliminated Lanczos vectors. Note that for the purposes of this 
description, we do not renumber the block Lanczos vectors when one is eliminated 
due to the removal of a dependent basis vector. We maintain the same indexing as 
before, meaning indices associated to annihilated vectors are no longer used. Let 
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(4.7) 



so that we have the compacted block Lanczos relation AUio = U12H10, where 

U12 = [Ui U2 • • • Ug Us Uio U12] , 



and Uio is similarly defined but without the last column. Notice that this removal of 
one dependent basis vector results in a reduction of the effective bandwidth by two 
when we switch from Hio to Hiq. However, this reduction happens in two stages. 
First, bandwidth reduces once in column five. Then in column eight (actually the 
seventh column) there is a second reduction. This can easily be generalized. For block 
size p, if we observe that Au^ is contained in an old Krylov subspace and therefore 
0, then we will see a reduction of bandwidth at column i and another reduction 



u, 



at column i+p. This implies a reduction in the number of vectors required for storage 
in the block Lanczos process. We need two fewer vectors for each block size reduction. 

This change in bandwidth is refiected in the upper triangular Jjands of Rio, the 
upper triangular matrix defined by the QR factorization Hio = QioR-io- We have 
the structure 
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(4.8) 



12 



with boldcd zeros indicating where an entry was annihilated by the reduction in block 
size. Recalling (|4.ip , this indicates that our storage requirements for constructing the 
search directions will change as the bandwidth changes. To construct ms, we need to 
store m4, ma, m.2., and mi along with U5. However, to construct mg, we only need to 
store four vectors: ms, 1114, 1113, and Ug. For mio, we only require three vectors: mg, 
my, and Uiq. This can easily be generalized for block-p. For each dependent basis 
vector removed, we will have a two-vector reduction in storage requirements for the 
construction of the search directions. In total, for each block size reduction, we have 
a four-vector reduction in storage requirements. 

We now discuss how choosing the second option (inserting a random, orthog- 
onalized vector into the basis) affects the algorithm. We begin by describing the 
replacement procedure in more detail. At iteration j, we compute Auj. After orthog- 
onalization, we sec that hj+pj < 7. Thus, Auj is (numerically) in the range of the 
previous Lanczos vectors. We set hj+p^j = thus the coefficients in column j encode 
that AvLj G span {ui, . . . , Uj}. Let Uj+p be a vector constructed by taking a random 
vector w, orthogonalizing w with respect to all previous Lanczos vectors, and setting 
Uj+p = w/ ||w|[. Then the algorithm continues as before with this modified block 
Lanczos basis. 

What modifications must be made to the block MINRES algorithm to accommo- 
date this strategy? It turns out, very few. Of course, we do not store the complete 
Lanczos basis, as this would defeat the purpose of developing a method for symmetric 
systems. However, we need to orthogonalizc the random vector against the entire ba- 
sis. As a work-around, we can generate a random vector at the start of the iteration 
and simply orthogonalizc against each Lanczos vector as it is created. This would 
require only one additional vector of storage and an additional orthogonalization per 
iteration. If we arc solving a problem in which we expect there to be more than one 
occurrence of loss of linear independence, we can generate more than one random 
vector, balancing between increasing the storage requirements and insuring against 
the basis dependence problem. 

One might be concerned that introducing a vector not created by the block Lanz- 
cos process will destroy the short-term recurrences which make symmetric Lanczos 
methods so attractive. However, this is not the case. Suppose that after iteration j, we 
continue Ruhc's block Lanczos process with the modified basis. Let U^+p e ©"^(-J+p) 
be the matrix containing the block Lanczos vectors but with Uj+p as its last column. 
Observe that the matrix U*_|_pAUj+p is symmetric; inserting the new basis vector 

does not effect this. Thus, the banded structure of Hj defined by AUj = U^+pHj is 
the same as that of Hj . The only change is that we now have zero entries at hj+p,j and 
hjj^p. This, in turn, gives a slight change in structure to Rj, the upper triangular 
factor in the QR- factorization of Hj . 

As an example, suppose p = 2 and that AV5 is in the span of the existing block 
Lanczos vectors, as in the last example. If we continue the Ruhe's block Lanczos 
process with the modified basis, we have the following structures for Hg, and Rg G 
(j-iiox8^ the upper triangular factor in the QR- factorization of Hg constructed using 
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Givens rotations, 
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This indicates that the final effects of replacing the dependent basis vector with a 
random one are minimal. The two zeros are introduced into upper Hessenberg matrix, 
but the bandwidth and symmetry properties remain unchanged. The introduction of 
a zero in the seventh column of Rg and another in the ninth simply means that the 
seventh and ninth block Lanczos vectors are linear combinations of the previous four 
rather than the previous five search directions, recalling the construction of the search 
directions (|4.1I) . 

5. Convergence Theory. Theoretically, MINRES is a version of block GMRES 
for symmetric systems. Simoncini and Gallopoulos discussed the convergence prop- 
erties of block GMRES [2Tj , including a result by Vital [23] . We can easily describe 
the quality of the residual produced at iteration j = k + mp. For b^*^ , the ith. column 
of the right-hand side B, Algorithm 14. II minimizes the ith column of the residual fj'-* 

over the subspace Kj_fc^„j(A, Fo). Thus, we can expect fj*"* to be at least as good 

as the norm of the residual produced by running J steps of MINRES with b(') as the 

/c if i < m 
k + 1 ii i > m 

by recalling the definition of Kjjt.,„(A, Fq) in p.3p . We observe that having a larger 
subspace over which to minimize is not guaranteed to give dramatic improvements 
in convergence. The additional information contained in Kj,fc,m(A, Fq) may not be 
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single right-hand side, where J 



This easily can be understood 



helpful in the minimization process. For specially related right-hand sides, though, 
we may have convergence in many fewer iterations. 

6. Numerical Results. These numerical experiments are meant to demonstrate 
the effectiveness and behavior of Algorithm 14.11 In all experiments, we compare the 
performance of block MINRES with sequential applications of Matlab's MINRES 
function. We compared performance using iteration counts and CPU timings. All 
tests were performed on a Macbook Pro containing a 2.3 GHz Intel Core 15 processor 
with 8 GB of 1333MIIz DDR3 main memory running the 64-bit version of Matlab 
R2011b. In any experiment involving the generation of random vectors, we used Mat- 
lab's mtl9937ar random number generator, with seed 0, which was initialized at the 
beginning of each experiment. Let L £ (D"^ ^"^ , with ni — 40000, be the discretization 
of the Laplacian operator on a 200 x 200 regular grid using central differences. This 
matrix is negative-definite. Let A = — L -I- 2001. Due to the eigenvalue distribution 
of L, we have that A is indefinite. 

It should be noted that using Ruhe's Lanczos implementation allows us to compare 
with sequential applications of a single-vector method iteration for iteration. This is 
particularly useful if we are using a block method simply to accelerate the convergence 
of an iteration applied to a system with one right-hand side, i.e., we have generated 
a few random artificial right-hand sides to create the block Krylov subspace. 




1 500 1000 1500 2000 2500 3000 3500 
Block MINRES: 1048 iter., 19.882 sec. 
Sequential MINRES: 3815 iter., 22.499 sec. 

Fig. 6.1. Comparison of the performance of Algorithm [4!T1 versus sequential applications of 
MINRES on the discretized Laplacian system with ten randomly generated right-hand sides. The 
solid black curve is actually the ten convergence curves for each right-hand side when solved by 
Alaorithm \A.l\ overlaid on one another. We see that in the case of these ten right-hand sides that 
block MINRES convergence for all ten systems is qualitatively the same. The black dashed curves 
are the convergence curves for each sequential application of MINRES for each right-hand side. 

We begin by demonstrating the performance of the algorithm on the shifted Lapla- 
cian system with ten randomly generate right-hand sides. In Figure I5TT1 we see that 
for these right-hand sides, the block MINRES algorithm offers a performance im- 
provement in terms of iterations and in time. The time improvement is moderate, 
but for larger, more-expensive-to-apply operators, the improvement in time could be 
more substantial. 

We demonstrate that our removal of dependent basis vectors works as described. 
Of course, it is difficult to choose a pair of right-hand sides for which basis dependence 
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will occur in later iterations. Thus, as a simple, easy-to-construct test, we choose the 
first right-hand side ei, as the first canonical basis vector. The second right-hand 
side is Aei, the image of the first canonical basis vector, i.e., the first column of 
our coefficient matrix. This will result in basis dependence at the first iteration of 
our algorithm. As is shown in Figure 16.21 this leads to immediate convergence for 
that system when running block MINRES. Of course, this example is not likely to 
occur in practice. It merely demonstrates that the algorithm can handle dependence 
gracefully. 




Bl. MR ei 

- Bl. MR Aei 



-10 , Scq. MR ei 

]--- Seq. MR Aei 

1 200 



Fig. 6.2. Demonstration of the algorithm's performance in the case that it encounters basis 
dependence. In this case, with the right-hand sides e\ and Aei, dependence occurs at the first 
iteration. Since the first right-hand side is the solution to the second system, we get immediate 
convergence for the second system, and block MINRES continues for the other system, replacing the 
dependent basis vector with a random one. 

We demonstrate how the relationship between the right-hand sides can effect 
the performance of this implementation of block MINRES with three experiments. 

We compared the performance of our block MINRES implementation with that of 
sequential runs of Matlab's MINRES for A with three pairs of right-hand sides. For 
the first pair, let bi = e"n} and b2 = 1, the vector of all ones. For second pair of right- 

(2) 

hand sides, we let bi = bi but change the second right-hand side by letting b2 = e„/. 
In Figure l673l we show a comparison of convergence curves for these pairs of right-hand 
sides. We observe that exchanging b2 for b2 degrades the performance of our Block 
MINRES implementation. Recall that the convergence of a Krylov subspacc method 
for a Hermitian system is completely determined by its eigenvalues. For an indefinite 
system, the eigenvalues closest to the origin cause a delay in convergence. Therefore, 
we hypothesize that a pair of right-hand sides that have strong components from 
different parts of the eigenspace associated to eigenvalues of small magnitude might 
complement each other well. Let Q € (C"i have orthonormal columns spanning the 
eigenspace associated to the 50 eigenvalues of A of smallest magnitude. We can study 
the magnitude of the components of bi, b2, and b2 in !K(Q) to better understand the 
convergence curves we see in Figure 16.31 In Figure 16. 4| we plot the components of 
Q*bi, Q*b2, and Q*b2. We see that the eigencomponents of bi, and b2 are similar 
in magnitude, meaning that a block Krylov subspace for these two right-hand sides 
might not contain subspacc information useful for accelerating convergence, when 
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Block MINRES: 362 iter., 2.2867 sec. Block MINRES: 555 iter., 3.4629 sec. 

Sequential MINRES: 551 iter., 3.2623 sec. Sequential MINRES: 572 iter., 3.3927 sec. 

Fig. 6.3. Performance of block MINRES for different right-hand sides. In the figure on the 
left, the two right-hand sides are bi = e^^ and b2 = 1. In the figure on the right, bi does not 
change, but £12 = e^^^' . 
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Fig. 6.4. Magnitude of the components of different right- hand- sides in the eigenspace associated 
to the fifty eigenvectors associated to the smallest magnitude eigenvalues. 



compared to its single- vector Krylov subspace counterparts. Many of the components 
of b2 in CR(Q) are orders of magnitude smaller than those of bi. Thirteen are orders 
of magnitude larger. This may be part of the reason that we are able to achieve some 
acceleration of convergence using block MINRES for this pair of right-hand-sides. To 
test this theory, we can construct a new right-hand-side b2 = b2 — QQ*b2 4- QQ*b2. 
This has the effect of removing the components of b2 in 3^(Q) and replacing them 
with the components of b2. In Figure l675l we see that when applying block MINRES 
to the pair of right- hand-sides bi and b2, the method is less effective when compared 
to the performance bi and b2 in Figure 15751 

This is by no means a rigorous analysis of convergence of a block method. These 
experiments only are meant to illustrate the variability of performance of a block 
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Block MINRES: 457 iter., 2.9113 sec. 
Sequential MINRES: 555 iter., 3.279 sec. 

Fig. 6.5. Performance of block MINRES after altering some components of the right- 
hand side b2. 



method for different right-hand sides and provide some insight into this phenomenon. 

7. Conclusions. We have presented an alternative implementation of the block 
MINRES residual algorithm. This version is based on Ruhe's block Krylov sub- 
space basis generation strategy which produces one orthonormal basis vector at a 
time rather than a full block of vectors. We provide not only a theoretical deriva- 
tion of the algorithm but also a discussion of the practical implementation issues 
which need to be addressed to fully take advantage of the efficiencies which arise in 
a block method for symmetric systems. This variant of the block MINRES method 
handles dependence of block Krylov subspace basis vectors in a more graceful manner 
then it's block-level brethren. A software implementation in Matlab is provided at 
: //math . soodhalter . com/ software . php 
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